Cox regression modeling of survival after chemotherapy for colon cancer
I. Introduction
The Cox Proportional Hazards (PH) model is one of the most widely used regression models to study survival analysis, also known as time-to-event analysis. The model investigates the association between survival time and one or more predictors. The Cox PH model is based on the hazard function, which is defined as the probability that an individual will experience the event at a given time, provided that the event has not occurred yet. The outcome variable is the survival time, which is measured from a defined time until the occurrence of an event, such as death, or the end of the study period [1]. For each subject, there is time to event (T), i.e., the time from a defined time until the event occurs, or censoring time (C), i.e., the time the subject drops out of the study or the study ends.
The Cox PH model relies on two assumptions: 1) random censoring, and 2) the proportional hazard assumption [2]. The random censoring assumption is met if patients who are censored are a random sample from the study population. There is no statistical test for this assumption, and it can be achieved through rigorous experimental design. The proportional hazard assumption states that the hazard function (hazard ratio) for two groups (e.g., experimental arm and standard arm) should remain proportional, i.e., the hazard ratio is constant over time. However, this assumption is violated when the effect of a covariate on the outcome is not constant over the follow-up period, which is common in biochemical research. When the proportional hazard assumption is not met, several methods can be applied to address this issue, including stratification of the model by variables that violate the assumption. This allows the baseline hazard to vary across strata [3]. Another approach is to include interactions between covariates and time, allowing the hazard ratios to change over time [4]. While the baseline hazards are allowed to differ, the effects of other covariates are assumed to be the same across strata. One limitation with this approach is that the effect of the stratification variable itself cannot be tested because it is not included as a covariate in the model. Other statistical methods to account for non-proportional hazards have been described and are discussed elsewhere [5].
One of the properties of the Cox PH model contributing to its popularity is that the baseline hazard, i.e., \(h_0\), is an unspecified field, which makes it a semi-parametric model. It is robust and can provide reliable estimates without specifying the baseline hazard function. Even though the baseline hazard is not known, it is possible to estimate the coefficients in the exponential part of the equation. Hence, the effect of variables included in the model can be estimated [6]. Another feature of the Cox PH model is its interpretability. The effect of covariates on the hazard is represented by hazard ratios, i.e., the relative likelihood of an event happening in the experimental group with respect to the standard group.
The Cox PH model is widely used across various fields, including medical research, epidemiology, business, engineering, and social sciences [6]. It is commonly used to determine survival rates among cancer patients with different subtypes of cancer [7], or between those treated with different treatments [8]. The Cox PH model has also been used effectively in non-medical research, for example, in the prediction of time to loan defaults [9], customer time to churn [10], product lifespan and failure time analysis in engineering [11], and for the identification of factors that influence the time it takes to achieve certain rates of success, such as vaccination rates, in public health [12].
The current analysis evaluated data from one of the first successful trials of adjuvant chemotherapy for colon cancer. The colon cancer patients in the study had their primary treatment of surgery and the objective was to test whether treatment with either Levamisole or Levamisole in combination with 5-FU chemotherapeutic agents improves survival. Levamisole is a low-toxicity compound previously used to treat worm infestations in animals; 5-FU is a moderately toxic chemotherapy agent. There are two records per person, one for recurrence and one for death. The purpose of this project is to compare survival between the untreated (Obs) group vs those treated with amisole (Lev), or amisole + 5-FU.
II. Methods
The Cox proportional hazards model was used to model the relationship between survival time and different colon cancer treatments. In particular, the survival time between the untreated group (observation) and those treated with amisole (Lev) or amisole + 5-FU was compared. The Cox regression model was chosen for this study because it is well-suited for studying the association between survival time of patients and predictors, and it allows estimating the risk or hazard of death increased or decreased due to each treatment relative to no treatment. The time (in days) until the event, i.e., death, will be modeled as a function of treatment and other variables, including age, sex, and various tumor characteristics. Significant predictors were included in the final model.
IIa. Data Source
The colon cancer cancer data set is a built-in data set in the Survival R package [13] [14].Data set includes 929 subjects with stage B/C colon cancer who were randomized to three treatment groups, Observation, Levamisole (Lev), or Levamisole + 5-FU. Patients were followed for 5-years after randomization. The data set includes survival and recurrence information on 929 individuals, but for the current analysis, it was filtered to include only the rows where death was the outcome variable. The time to death is given in days. The dataset includes various patient characteristics, such as demographics and tumor details, as well as the duration from surgery to trial registration, categorized as either short or long.
Column names:
| id: | id |
| study: | 1 for all patients |
| rx: | Treatment - Obs(ervation), Lev(amisole), Lev(amisole)+5-FU |
| sex: | 1=male |
| age: | in years |
| obstruct: | obstruction of colon by tumour |
| perfor: | perforation of colon |
| adhere: | adherence to nearby organs |
| nodes: | number of lymph nodes with detectable cancer |
| time: | days until event or censoring |
| status: | censoring status |
| differ: | differentiation of tumour (1=well, 2=moderate, 3=poor) |
| extent: | Extent of local spread (1=submucosa, 2=muscle, 3=serosa, 4=contiguous structures) |
| surg: | time from surgery to registration (0=short, 1=long) |
| node4: | more than 4 positive lymph nodes |
| etype: | event type: 1=recurrence,2=death |
IIb. Exploratory data analysis
Multicolinearity was tested using Variant Inflation Factor (VIF) calculated using MASS package [15].The Survminer [16] package was used to plot the Kaplan-Meier curve to visualize the survival probability over time for each of the categorical variables. The data was evaluated for missing values, duplicate entry, and outliers. Multicolinearity was tested using Variant Inflation Factor (VIF) calculated using MASS package [15].
IIc. Statistical analysis
The R statistical software version 4.3.2 [17] was used for all analysis. The Survival package was used to construct the Cox regression model [13] [14].
Cox regression model is based on the hazard function \(h_x(t)\) with covariates at time t given by:
\(h_x(t)=h_0(t)\exp(\beta_1x_1 +\beta_2x_2 + \dots + \beta_p x_p)\)
Where:
\(h_x(t)\) is the hazard function
\(h_0(t)\) is the baseline hazard function
\(\beta_1x_1 + \beta_2x_2 + \dots +\beta_p x_p\) represent the linear combination of covariates and their coefficient
The hazards for the observation vs. amisole (Lev), or amisole + 5-FU group with covariate values x1 and x2 are given by: \(hx_1(t)=h_0(t)\exp(\beta_1x_1)\) and \(hx_2(t)=h_0(t)\exp(\beta_2x_2)\), respectively
The hazard ratio is expressed as: HR = \(hx_2(t)\) / \(hx_1(t)\) = \(\exp[\beta(x_2-x_1)]\)
The R MASS package was used for Step-wise variable selection, using “both” forward and backward variable selection [15]. For Step-wise selection, stepAIC() function uses AIC (Akaike Information Criterion) as the measure to add or remove predictors from the model.
IId. Model evaluation
Diagnostic for proportional hazard assumption
The Schoenfeld residual plot was constructed to test Cox proportional hazards assumption. When the proportional hazards assumpiton was not met for any of the covariates, stratification approach was used.
K-fold cross-validation
Model performance was evaluated using 5-fold cross-validation using the boot package [18] [19].
Model selection
The Concordance (c-index), AIC, and BIC of corresponding models were compared to select the best fit model. The model with the lowest AIC and BIC values and highest concordance was selected as the final model.
III. Analysis and Results
IIIa. Data cleaning and feature engineering
The data set contained missing values in number of nodes and differentiation columns. Missing values in differentiation and node columns were replaced with with mode and median, respectively. Numbers representing different categories in differentiation, local spread and time to surgery columns were replaced with the descriptive names. Evaluation of nodes and node4 showed that samples with greater than 4 lymph nodes have less than 5 count in nodes column, so the two columns are not consistent. Therefore, nodes column will not be used for further analysis.
IIIb. Exploratory data analysis
Figure 1: Exploratory data analysis
Exploratory data visualization shows (Figure 1.) The study participants were equally randomized in the three treatment groups, with about 33% of individuals in observation, Levamisole (Lev) or Lev + 5-FU groups. Fifty-two percent of the participants were male. The dataset is balanced with respect to the outcome (status), with 51% censored and 49% died bevore the end of follow-up period.
With respect to tumor characteristics, obstruction of colon by tumor (obstruct), perforation of colon (perfor), tumor adherence to nearby organs (adhere) was observed in 19%, 3% and 15% of the patients, respectively. About 27% of the participants have more than 4 tumor positive lymph nodes (node 4) and long time from surgery to registration (surg). Overall, most of the categories of tumor characteristics are well represented, with about 15% of the participants with those characteristics, except for perforation, differentiation and local spread.
Age is normally distributed. Evaluation of time to death or censoring showed a bimodal distribution. Stratifying the time variable by outcome (status), revealed that the majority of the patients died within 3 years of the study period.
Table 1: Patient characteristics
| Observation (%) | Amisole (%) | Amisole + 5-FU (%) | ||
|---|---|---|---|---|
| N=315 | N=310 | N=304 | ||
| Demographics | ||||
| Male | 166 (52.3) | 177 (57.1) | 141 | |
| Median age (years) [IQR] | 60 [53,68] | 61 [53,69] | 61 [52,70] | |
| Cancer characteristics | ||||
| Colon obstruction | 63 (20.0) | 63 (20.3) | 54 (17.8) | |
| Colon perforation | 9 (2.9) | 10 (3.2) | 8 (2.6) | |
| Adherence to nearby organs | 47 (14.9) | 49 (15.8) | 39 (12.8) | |
| Differentiation of tumor | ||||
| Well | 27 (8.6) | 37 (11.9) | 29 (9.5) | |
| Moderate | 236 (74.9) | 229 (73.9) | 221 (72.7) | |
| Poor | 52 (16.5) | 44 (14.2) | 54 (17.8) | |
| Extent of local spread | ||||
| Contiguous | 20 (6.3) | 12 (3.9) | 11 (3.6) | |
| Muscle | 38 (12.1) | 36 (11.6) | 32 (10.5) | |
| Serosa | 249 (79.0) | 259 (83.5) | 251 (82.6) | |
| Submucosa | 8 (2.5) | 3 (1.0) | 10 (3.3) | |
| More than 4 lymph nodes with cancer | Yes | 87 (27.6) | 89 (28.7) | 79 (26.0) |
| Short time from surgery to registration (%) | Yes | 91 (28.9) | 80 (25.8) | 76 (25.0) |
As shown in Table 1, the baseline patient characteristics were very similar between the three treatment groups, indicating suggesting that and differences in survival time between the three groups is not likely attributed to differences at the start of the study.
IIIc. Survival curves
#| echo: false
#| message: false
#| warning: false
#| include: true
colon_surv$idcount <- c(1:length(colon_surv$id))
# Order by survival time and create an order variable:
colon_surv <- colon_surv[order(-colon_surv$time, colon_surv$status),]
colon_surv$order <- c(1:length(colon_surv$idcount))
# Create a data frame for vline intercepts
vline_data <- data.frame(
status = unique(colon_surv$status),
vline = c(2331, 775) # Example values for each status
)
# Create a data frame for text labels
label_data <- data.frame(
status = unique(colon_surv$status),
x = 2600, # X position for the label
y = 750, # Y position for the label
label = "Median Survival Time"
)
ggplot(data=colon_surv, aes(x=time, y=order)) +
geom_rect(xmin=23, xmax=colon_surv$time, ymin=colon_surv$order, ymax=colon_surv$order+1, colour="lightblue") +
geom_rect(xmin=colon_surv$time-2, xmax=colon_surv$time, ymin=colon_surv$order, ymax=colon_surv$order+1, fill=factor(colon_surv$status+1)) +
geom_vline(data=vline_data, aes(xintercept=vline), linetype="solid") + # Add vertical lines with different values
geom_text(data=label_data, aes(x=x, y=y, label=label)) + # Add text labels
scale_x_continuous(breaks=seq(20, 3330, 650)) +
xlab("Survival Time (Days)") +
ylab("Participants (ordered by survival time)") +
ggtitle("Survival Times for Participant") +
facet_wrap(~status) +
theme_classic() +
theme(
legend.position="none",
panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),
panel.background=element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black")
)Figure 2: Censoring (left) or survival (right) time of patients
The median survival time of patients who were censored (left) or died (right) was 2331 and 775 days, respectively. This results shows that patients who were not censored died early during the follow-up period.
Stratified survival curves
#| echo: false
#| message: false
#| warning: false
#| include: true
f1 <- survfit(Surv(time, status) ~ adhere, data = df)
f2 <- survfit(Surv(time, status) ~ rx, data = df)
f3 <- survfit(Surv(time, status) ~ sex, data = df)
f4 <- survfit(Surv(time, status) ~ obstruct, data = df)
f5 <- survfit(Surv(time, status) ~ perfor, data = df)
f6 <- survfit(Surv(time, status) ~ node4, data = df)
f7 <- survfit(Surv(time, status) ~ differentiation, data = df)
f8 <- survfit(Surv(time, status) ~ local_spread, data = df)
f9 <- survfit(Surv(time, status) ~ surg, data = df)
fits <- list(adhere = f1, rx = f2, sex =f3, obstruct = f4 , perfor=f5 , node4 =f6 , differentiation =f7 , local_spread =f8 , surg= f9)
# Visualize
legend.title <- list("adhere", "rx", "sex", "obstruct", "perfor", "node4", "differentiation", "local_spread", "surg")
ggsurvplot_list(fits, df, pval=TRUE) #legend.title = legend.title,$adhere
$rx
$sex
$obstruct
$perfor
$node4
$differentiation
$local_spread
$surg
attr(,"class")
[1] "list" "ggsurvplot_list"
Figure 3: Survival curves stratified by categorical variables.
There is a statistically significant difference in survival times between two or more groups being compared as shown in the survival curves stratified by sex, obstruct, adherence to nearby organs, presense of four or mor tumor positive nodes, differentiation, local spread and surgery. Survival time is not significantly different between male and female or among those with or without tumor perforation. Patient characterstics that showed significant differences in survival time between two categories are important predicotrs of survival time. However, The kaplan Mier’s curve does not adjust for other variables, therefore, some of the predictors may not be significant when included in the multivarible Cox regression model.
IIId. Cox regression models
Model_1: Base Model
Model_1 <- coxph(Surv(time, status) ~ 1, data = df)
c_index_model_1 <- concordance(Model_1)
cat("Concordance of the base model:",c_index_model_1$concordance)Concordance of the base model: 0.5
Model_2: Univariate analysis
# Univariate analysis
Model_2 <- coxph(Surv(time, status) ~ rx, data = df)
# Create the regression table and add concordance statistic
summary_table <- tbl_regression(Model_2, exponentiate = TRUE) %>%
add_glance_source_note(
label = list(concordance = "Concordance"),
include = c("concordance")
) %>%
modify_table_styling(
columns = p.value,
rows = p.value < 0.05,
text_format = "bold"
)
# Convert to gt table, increase font size, and adjust width
gt_table <- as_gt(summary_table) %>%
gt::tab_options(
table.font.size = px(17), # Increase font size
table.align = "center", # Align table to the left
table.width = pct(50) # Make the table wider
)
# Print the gt table
gt_table| Characteristic | HR1 | 95% CI1 | p-value |
|---|---|---|---|
| rx | |||
| Obs | — | — | |
| Lev | 0.97 | 0.78, 1.21 | 0.8 |
| Lev+5FU | 0.69 | 0.55, 0.87 | 0.002 |
| Concordance = 0.536 | |||
| 1 HR = Hazard Ratio, CI = Confidence Interval | |||
The coefficient of Lev is not significant, suggesting that there is no evidence that this treatment affects survival time compared to observation. however Lev+5Fu is significant (p=0.00175), indicating that the treatment Lev +5Fu has a statistically significant effect on survival time compared to the reference group. The negative sign indicates that this treatment group has a lower hazard and likely a longer survival time. The hazard ratio for Lex+5FU (0.690), indicating the risk of death is about 31% lower compared to the observation group.
Model_2: Cox proportional hazard assumption test
zph_test <- cox.zph(Model_2)
# Convert the Schoenfeld residuals test results to a data frame
zph_df <- as.data.frame(zph_test$table)
zph_df$Variable <- rownames(zph_df)
zph_df <- zph_df %>%
mutate(
chisq = round(chisq, 3),
p = round(p, 3)
)
zph_df %>%
kbl(caption = "Schoenfeld Residuals Test Results") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE, position = "center") %>%
row_spec(0, bold = TRUE, align = "center")| chisq | df | p | Variable | |
|---|---|---|---|---|
| rx | 1.481 | 2 | 0.477 | rx |
| GLOBAL | 1.481 | 2 | 0.477 | GLOBAL |
# plot the Schoenfeld residuals
plot(zph_test)The Schoenfeld residal plot shows that the residuals are scattered randomly and the smooth trend line is horizontal near 0. This suggests that the hazard ratio for rx (treatment status) is constant over time and the proportional hazard assumption is met. The global p-value is >0.05, indicating that the the assumption is met.
Model_3: All predictors
# Full variables: All predictors
Model_3<- coxph(Surv(time, status) ~ rx+ age + sex + perfor + adhere + surg + obstruct + differentiation + node4+ local_spread, data = df)
summary_table <- tbl_regression(Model_3, exponentiate = TRUE) %>%
add_glance_source_note(
label = list(concordance = "Concordance"),
include = c("concordance")
) %>%
modify_table_styling(
columns = p.value,
rows = p.value < 0.05,
text_format = "bold"
)
gt_table <- as_gt(summary_table) %>%
tab_options(
table.font.size = px(17), # Reduce font size
table.align = "center", # Align table to the center
table.width = pct(50), # Make the table wider
data_row.padding = px(2) # Reduce row padding
)
gt_table| Characteristic | HR1 | 95% CI1 | p-value |
|---|---|---|---|
| rx | |||
| Obs | — | — | |
| Lev | 0.98 | 0.79, 1.22 | 0.9 |
| Lev+5FU | 0.69 | 0.54, 0.87 | 0.002 |
| age | 1.01 | 1.00, 1.02 | 0.083 |
| sex | 1.04 | 0.86, 1.26 | 0.7 |
| perfor | 1.00 | 0.59, 1.70 | >0.9 |
| adhere | 1.18 | 0.92, 1.53 | 0.2 |
| surg | 1.27 | 1.03, 1.55 | 0.022 |
| obstruct | 1.33 | 1.06, 1.68 | 0.015 |
| differentiation | |||
| moderate | — | — | |
| poor | 1.43 | 1.13, 1.82 | 0.003 |
| well | 1.08 | 0.78, 1.50 | 0.6 |
| node4 | 2.55 | 2.10, 3.09 | <0.001 |
| local_spread | |||
| contiguous | — | — | |
| muscle | 0.39 | 0.23, 0.64 | <0.001 |
| serosa | 0.64 | 0.43, 0.94 | 0.023 |
| submucosa | 0.29 | 0.10, 0.83 | 0.021 |
| Concordance = 0.674 | |||
| 1 HR = Hazard Ratio, CI = Confidence Interval | |||
Model_3 summary shows that, when all variables are included in the model; age, sex, perforation and adherance are not significant predictors. wheras, rx, surg, obstruct, differentiation, node4 and local spread are significant predictors. The concordance of the multivariable model, 0.674, is higher than the univariate model (concordance =0.53), suggesting that the multivariate model is a better fit model. However, the model concordance did not improve when removing predictors that were not significant in Model_3.
Stepwise variable selection:
Stepwise variable selection approach was used to confirm that the significant predictors in Model_3 are the best set of predictors. The MASS package stepAIC() function was used for stepwise variable selection by using AIC (Akaike Information Criterion) as the measure to add or remove predictors from the model.
# Model_2 includes all variables
Model_2 <- coxph(Surv(time, status) ~ rx+ age + sex + perfor + adhere + surg + obstruct + differentiation + node4+
local_spread, data = df)
# stepwise selection
stepwise_model <- stepAIC(Model_2, direction = "both")Start: AIC=5741.4
Surv(time, status) ~ rx + age + sex + perfor + adhere + surg +
obstruct + differentiation + node4 + local_spread
Df AIC
- perfor 1 5739.4
- sex 1 5739.6
- adhere 1 5741.0
<none> 5741.4
- age 1 5742.5
- surg 1 5744.5
- obstruct 1 5745.1
- differentiation 2 5745.4
- rx 2 5749.9
- local_spread 3 5753.1
- node4 1 5822.0
Step: AIC=5739.4
Surv(time, status) ~ rx + age + sex + adhere + surg + obstruct +
differentiation + node4 + local_spread
Df AIC
- sex 1 5737.6
- adhere 1 5739.1
<none> 5739.4
- age 1 5740.5
+ perfor 1 5741.4
- surg 1 5742.5
- obstruct 1 5743.2
- differentiation 2 5743.5
- rx 2 5747.9
- local_spread 3 5751.1
- node4 1 5820.1
Step: AIC=5737.59
Surv(time, status) ~ rx + age + adhere + surg + obstruct + differentiation +
node4 + local_spread
Df AIC
- adhere 1 5737.3
<none> 5737.6
- age 1 5738.6
+ sex 1 5739.4
+ perfor 1 5739.6
- surg 1 5740.7
- obstruct 1 5741.2
- differentiation 2 5741.7
- rx 2 5746.2
- local_spread 3 5749.3
- node4 1 5818.2
Step: AIC=5737.26
Surv(time, status) ~ rx + age + surg + obstruct + differentiation +
node4 + local_spread
Df AIC
<none> 5737.3
+ adhere 1 5737.6
- age 1 5738.6
+ sex 1 5739.1
+ perfor 1 5739.2
- surg 1 5740.7
- obstruct 1 5740.9
- differentiation 2 5742.1
- rx 2 5746.0
- local_spread 3 5750.4
- node4 1 5817.4
# Extract the coefficients of the selected model
selected_variables <- coef(stepwise_model)
# Print the selected variables
print(selected_variables) rxLev rxLev+5FU age
-0.010789186 -0.375746378 0.007365529
surg obstruct differentiationpoor
0.243870576 0.283164609 0.373782987
differentiationwell node4 local_spreadmuscle
0.069021620 0.929854405 -0.995556083
local_spreadserosa local_spreadsubmucosa
-0.501168855 -1.322018421
Based on a step-wise selection, rx, age, surg, obstruct, differentiation, node4 and local spread were selected for inclusion in the final model. Age is a significant predictor in the step-wise selected variables, but not in the full variable model (Model_2).
Model_4: Step-wise Selected variables
Model_4 <- coxph(Surv(time, status) ~ rx + age + surg + obstruct +
differentiation + node4 + local_spread, data = df)
summary_table <- tbl_regression(Model_4, exponentiate = TRUE) %>%
add_glance_source_note(
label = list(concordance = "Concordance"),
include = c("concordance")
) %>%
modify_table_styling(
columns = p.value,
rows = p.value < 0.05,
text_format = "bold"
)
gt_table <- as_gt(summary_table) %>%
tab_options(
table.font.size = px(17), # Reduce font size
table.align = "center", # Align table to the center
table.width = pct(50), # Make the table wider
data_row.padding = px(2) # Reduce row padding
)
gt_table| Characteristic | HR1 | 95% CI1 | p-value |
|---|---|---|---|
| rx | |||
| Obs | — | — | |
| Lev | 0.99 | 0.80, 1.23 | >0.9 |
| Lev+5FU | 0.69 | 0.54, 0.87 | 0.002 |
| age | 1.01 | 1.00, 1.02 | 0.069 |
| surg | 1.28 | 1.04, 1.56 | 0.018 |
| obstruct | 1.33 | 1.06, 1.67 | 0.015 |
| differentiation | |||
| moderate | — | — | |
| poor | 1.45 | 1.15, 1.84 | 0.002 |
| well | 1.07 | 0.77, 1.48 | 0.7 |
| node4 | 2.53 | 2.09, 3.07 | <0.001 |
| local_spread | |||
| contiguous | — | — | |
| muscle | 0.37 | 0.23, 0.61 | <0.001 |
| serosa | 0.61 | 0.41, 0.89 | 0.010 |
| submucosa | 0.27 | 0.09, 0.76 | 0.014 |
| Concordance = 0.672 | |||
| 1 HR = Hazard Ratio, CI = Confidence Interval | |||
The concordance of Model_4 did not improve when compared to Model_3, which included all predictors, but the model is less complex since in includes a smaller number of variables (see model comparison table below)
Model_4: Cox proportional hazard assumption test
# final model with step wise variable selection
zph_test <- cox.zph(Model_4)
# Convert the Schoenfeld residuals test results to a data frame
zph_df <- as.data.frame(zph_test$table)
zph_df$Variable <- rownames(zph_df)
zph_df <- zph_df %>%
mutate(
chisq = round(chisq, 3),
p = round(p, 3)
)
zph_df %>%
kbl(caption = "Schoenfeld Residuals Test Results") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE, position = "center") %>%
row_spec(0, bold = TRUE, align = "center")| chisq | df | p | Variable | |
|---|---|---|---|---|
| rx | 2.335 | 2 | 0.311 | rx |
| age | 0.549 | 1 | 0.459 | age |
| surg | 0.020 | 1 | 0.888 | surg |
| obstruct | 6.148 | 1 | 0.013 | obstruct |
| differentiation | 17.459 | 2 | 0.000 | differentiation |
| node4 | 5.662 | 1 | 0.017 | node4 |
| local_spread | 7.083 | 3 | 0.069 | local_spread |
| GLOBAL | 37.525 | 11 | 0.000 | GLOBAL |
Model_4 does not meet Cox poportional hazard assumption as shown by p-values less than 0.05 in the Schoenfeld residuals test. Differentiation, node4 and obstruct variables did not meet proportional hazards assumption, suggesting that effects of these variables are not constant over time. To overcome this, stratification approach will be used, where the model will be stratified by variables that did not meet proportional hazard.
Model_5: Stratified Model
| Characteristic | HR1 | 95% CI1 | p-value |
|---|---|---|---|
| rx | |||
| Obs | — | — | |
| Lev | 0.98 | 0.79, 1.22 | 0.9 |
| Lev+5FU | 0.71 | 0.56, 0.89 | 0.003 |
| age | 1.01 | 1.00, 1.02 | 0.034 |
| surg | 1.30 | 1.06, 1.59 | 0.012 |
| node4 | 2.50 | 2.06, 3.04 | <0.001 |
| local_spread | |||
| contiguous | — | — | |
| muscle | 0.34 | 0.21, 0.56 | <0.001 |
| serosa | 0.58 | 0.39, 0.84 | 0.004 |
| submucosa | 0.24 | 0.08, 0.67 | 0.007 |
| Concordance = 0.674 | |||
| 1 HR = Hazard Ratio, CI = Confidence Interval | |||
Model_5 Cox proportional hazard assumption test
zph_test <- cox.zph(Model_5)
zph_test chisq df p
rx 2.0007 2 0.368
age 0.6704 1 0.413
surg 0.0142 1 0.905
node4 4.2882 1 0.038
local_spread 5.2976 3 0.151
GLOBAL 12.4113 8 0.134
# Convert the Schoenfeld residuals test results to a data frame
zph_df <- as.data.frame(zph_test$table)
zph_df$Variable <- rownames(zph_df)
zph_df <- zph_df %>%
mutate(
chisq = round(chisq, 3),
p = round(p, 3)
)
zph_df %>%
kbl(caption = "Schoenfeld Residuals Test Results") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE, position = "center") %>%
row_spec(0, bold = TRUE, align = "center")| chisq | df | p | Variable | |
|---|---|---|---|---|
| rx | 2.001 | 2 | 0.368 | rx |
| age | 0.670 | 1 | 0.413 | age |
| surg | 0.014 | 1 | 0.905 | surg |
| node4 | 4.288 | 1 | 0.038 | node4 |
| local_spread | 5.298 | 3 | 0.151 | local_spread |
| GLOBAL | 12.411 | 8 | 0.134 | GLOBAL |
# plot the Schoenfeld residuals
plot(zph_test)After model stratification by obstruct and differentiation, the proportional hazard assumption is met as the global p-value is >0.05. Node4 slightly violates assumption, but the final model is not stratified by node4 because the model concordance is attenuated when stratifying by node4.
Evaluate multicollinearity of variables in Model_5
vif <- vif(Model_5)
print(vif) GVIF Df GVIF^(1/(2*Df))
rx 1.008066 2 1.002010
age 1.019142 1 1.009525
surg 1.008430 1 1.004206
strata(obstruct) 3.071784 0 Inf
strata(differentiation) 3.071784 0 Inf
node4 1.023390 1 1.011628
local_spread 1.017073 3 1.002825
None of the variables has VIF >5, therefore multicollinearity is not a problem in Model_5
Model comparison
# Fit the models and store them in a list
models <- list(Model_1, Model_2, Model_3, Model_4, Model_5)
# Add descriptions for each model
descriptions <- c(
"Base model",
"Treatment",
"Full variables",
"Stepwise-selected variables",
"Stratified"
)
# Create a data frame to store results
results <- data.frame(
Model = character(),
Description = character(),
AIC = numeric(),
BIC = numeric(),
C_Index = numeric(),
stringsAsFactors = FALSE
)
# Function to calculate and store metrics for each model
for (i in seq_along(models)) {
model <- models[[i]]
# Extract AIC and BIC
aic <- AIC(model)
bic <- BIC(model)
# Add C-index
c_index <- concordance(model)$concordance
# Append results to the data frame
results <- rbind(results, data.frame(
Model = paste("Model", i),
Description = descriptions[i],
AIC = aic,
BIC = bic,
C_Index = round(c_index, 3)
))
}
# Print the table using kable and kableExtra
results %>%
kbl(caption = "Model Evaluation Metrics") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE, position = "center") %>%
column_spec(1, bold = TRUE) %>%
column_spec(2:5, width = "10em") %>%
kable_styling(position = "center")| Model | Description | AIC | BIC | C_Index |
|---|---|---|---|---|
| Model 1 | Base model | 5860.383 | 5860.383 | 0.500 |
| Model 2 | Treatment | 5741.401 | 5798.993 | 0.674 |
| Model 3 | Full variables | 5741.401 | 5798.993 | 0.674 |
| Model 4 | Stepwise-selected variables | 5737.261 | 5782.511 | 0.672 |
| Model 5 | Stratified | 4567.829 | 4600.739 | 0.674 |
Based on the model evaluation metrics, Model_5 has the smallest AIC and BIC values, and achieved highest concordance (c-index). Although Model_2 and Model_5 have the same concordance, Model_5 is the best model as indicated by the smaller AIC and BIC values when compared to Model_2. 5-fold cross-validation was performed to test the robustness of Model_5.
K-fold cross validation
set.seed(1234)
# Cox model
cox_model <- coxph(Surv(time, status) ~ rx + age + surg + strata(obstruct) + strata(differentiation) + node4 +
local_spread, data = df)
# calculate the original c-index
c_index_original <- concordance(cox_model)
cat("original c-index:", c_index_original$concordance, "\n")original c-index: 0.674316
# create a function for calculating c-index in each fold using concordance()
cox_cindex <- function(train_data, test_data) {
fit <- coxph(Surv(time, status) ~ rx + age + surg + strata(obstruct) + strata(differentiation) + node4 + local_spread, data = train_data)
# Calculate concordance on test data
c_index <- concordance(fit, newdata = test_data)$concordance
return(c_index)
}
# perform 5-fold cross-validation with stratification
K <- 5
folds <- createFolds(c(df$status, df$differentiation, df$rx), k = K, list = TRUE, returnTrain = TRUE)
cv_c_indices <- sapply(folds, function(train_indices) {
train_data <- df[train_indices, ]
test_data <- df[-train_indices, ]
cox_cindex(train_data, test_data) # use the concordance function inside cox_cindex
})
# Print cross-validated c-indices
print(cv_c_indices) Fold1 Fold2 Fold3 Fold4 Fold5
0.7156051 0.6605045 0.6562016 0.6477106 0.6943820
# cross-validation c-indices
cat("cross-validated c-Indices for each fold:", cv_c_indices, "\n")cross-validated c-Indices for each fold: 0.7156051 0.6605045 0.6562016 0.6477106 0.694382
cat("mean cross-validated c-Index:", mean(cv_c_indices), "\n")mean cross-validated c-Index: 0.6748808
# plot cross-validation c-indices
plot(cv_c_indices, type = "b", xlab = "Fold", ylab = "c-index", main = "c-index across folds")Model_5 c-index (0.674) and mean cross-validation c-index (0.675) is very similar, suggesting the the final stratified model is stable and is not overfitting.
IV. Conclusions
- Treatment with Levamisole + 5-FU decreases the hazard of death from colon cancer by 29.5% (HR =0.71, 95% CI: 0.56-0.89; p=0.0035).
- Having more than 4 tumor positive lymph nodes significantly increases the hazard of death by 150.3% (p<0.0001).
- Having a long wait period from surgery to registration for trial is associated with an increase in the hazard by 99.5% (p=0.01).
- Patients with local tumor spread to the submucosa, muscle and serosa have a reduction in the hazard by 76% (p=0.0007), 66% (p<0.001), and 43% (0.004), respectively, compared to those with contiguous organ spread.
- The concordance of the model (0.67) indicates moderate predictive accuracy for survival time.
- Other variables that were not included in the study may contribute to survival time.